This document presents the primary analyses of the VIBRANT trial.
Currently, this runs on simulated data, but the code is mostly ready to be run on the real data once it is available.
The simulated data is currently relatively realistic (but still needs some improvement), and is very optimistic (assumes a very successful trial).
Until we have the actual data, we intend to improve the simulated data (= make them more realistic) by
Model the metagenomic data from an actual file provided by Jacques’ lab (currently, we “invented” the format, but it would be much better to have an actual pilot example).
Add demographic variables
Add dropouts (participants who left the study before the endpoint), missed visits or missed data (failed assay or uncollected sample)
Add replacements
Add differences in LBP strain survival rates between sites, arms, and participants
Add inter-individual variability in how long the LBP strains survive
Add bacterial species (currently, we only have the 15 LBP strains, “other” L. crispatus, L. iners, and “non-Lacto” (lumped together))
(add contamination)
Consequently, some (supplementary) analyses are still missing
Demographics (Table 1)
Sensitivity analyses for the different populations (ITT, mITT, PP)
… + search for “TODO” or “XXX” within the document (placeholders for analyses that still need to be implemented or text that needs to be written)
These will be implemented once the corresponding improvement steps detailed above are themselves implemented.
Even with all these improvements, there will, inevitably, be un-anticipated issues with the real data, but the core of the methods (and corresponding code) will likely be unchanged.
Workflow
The document is organized as follow
Data preparation: load the raw data as provided by labs/teams, and collate them into a neat MultiAssayExperiment object containing all the data linked by participant x visit.
Checks, QC, and Outcomes Definitions: check the data for potential issues, perform quality controls (e.g, check for library sizes, potential contamination, etc.), and define outcomes of interest (colonization status, etc.)
Analyses: perform the primary analyses of the trial, including demographics, primary and secondary outcomes, and sensitivity analyses. This section has the main tables and figures.
Data preparation
Loading the data and creating SummarizedExperiment objects for each assays
In this section, we load the raw data from their original files (i.e., the data/files as generated by the corresponding labs or teams) and transform each assay into a SummarizedExperiment object. This will enable to then collect them into a single MultiAssayExperiment object and perform the integrative analyses.
To link the assays together,we define a common unique identifier for each participant x visit. This unique identifier (.sample) is the concatenation of the pid (participant ID) and the visit_code.
Clinical and survey (CRF) data
Information about the data
The data acquisition process is described XXXHEREXXXX.
The data encoding in a xxx database and quality control processes are described XXXHEREXXX.
The data was exported from the database by Lara Lewis (CAPRISA, South Africa) as a set of xxx files.
Lara Wautier (UCLouvain, Belgium) then imported these files in R and assembled the data into two tables: a participants and a visits table. The code is available XXXHEREXXX.
The participants table contains participant-invariant information such as the participant ID, the treatment arm, site, and study population (ITT, mITT, PP). The visits table contains participant-varying information collected at each visit.
The columns necessary for the primary trial analyses were then selected from these two and exported into two files: participants.RDS and visits.RDS, which we load below.
Loading the data
We load the participant and visits data, and create
one table which will become the MAE’s colData
one SE object which provides attendance information at each visit (this also ensures that the MAE object has data for all visits, which is useful for downstream analyses)
A MultiAssayExperiment object of 3 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 3:
[1] visit_attendance: SummarizedExperiment with 2 rows and 4600 columns
[2] mg_ksanity: SummarizedExperiment with 16 rows and 900 columns
[3] amplicon_ASV_all: SummarizedExperiment with 3 rows and 1000 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
Checks, QC, and Outcomes Definitions
Checks and Quality Controls
Visualization of available data for each assay and each site x arm x participant x visit
available_data |>ggplot(aes(x = assay, y = pid, col = assay)) +geom_point() +facet_grid(arm + site ~ visit_code, scales ="free", space ="free") +theme(axis.text.x =element_text(angle =90, hjust =1, vjust =0.5), legend.position ="none")
Checking for contamination
While it is possible for a few Placebo participants to natively have strains that are extremely similar to the LBP strains, it is highly unlikely for many of the LBP strains to be present simultaneously in a single sample for a placebo participant. Such event would much more likely be the result of contamination OR a mislabeling of the sample OR a randomization issue.
We thus check how many LBP strains are simultaneously present in the Placebo participants samples
TODO
Other checks ?
Outcomes definitions
Primary outcome definition
The primary outcome is XXX INSERT DEFINITION XXX
Consequently, we need to compute the relative abundance of each LBP strains from their proportions out of total L. crispatus proportions and the total L. crispatus proportions as estimated by 16S sequencing.
From this, we can compute the total relative abundance of all LBP strains or the maximum relative abundance of any LBP strain in each sample.
If the total relative abundance of all LBP strains is larger than 0.1 or the maximum relative abundance of any LBP strain is larger than 0.05, we consider that the LBP achieved colonization at that visit.
From the colonization status at each visit, we can compute the colonization status by each visit. A participant is considered to have colonized by a visit if they have been colonized at that visit or any previous visit.
16S rRNA microbiota composition, aggregated at the species level
TODO
16S rRNA microbiota composition, augmented with LBP strains relative abundances
We combine the 16S rRNA microbiota composition with the LBP strains relative abundances to create a new assay mb_composition.
TODO
We now have an “augmented” MAE object with the following assays:
Code
mae
A MultiAssayExperiment object of 5 listed
experiments with user-defined names and respective classes.
Containing an ExperimentList class object of length 5:
[1] visit_attendance: SummarizedExperiment with 2 rows and 4600 columns
[2] mg_ksanity: SummarizedExperiment with 16 rows and 900 columns
[3] amplicon_ASV_all: SummarizedExperiment with 3 rows and 1000 columns
[4] LBP_rel_ab: SummarizedExperiment with 15 rows and 900 columns
[5] LBP_colonization: SummarizedExperiment with 4 rows and 900 columns
Functionality:
experiments() - obtain the ExperimentList instance
colData() - the primary/phenotype DataFrame
sampleMap() - the sample coordination DataFrame
`$`, `[`, `[[` - extract colData columns, subset, or experiment
*Format() - convert into a long or wide DataFrame
assays() - convert ExperimentList to a SimpleList of matrices
exportClass() - save data to flat files
If we have 0 successes in none of the Placebo arm, we won’t be able to use a logistic regression model that accounts for site differences since we have a “perfect separation issue” and the odds are infinite for the Placebo arm:
Code
fit <-glm(colonized_by ~ arm + site + arm:site, data = col_by_week5, family = binomial)fit |>summary()
Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial,
data = col_by_week5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.857e+01 2.063e+03 -0.009 0.993
armLC-106-7 1.941e+01 2.063e+03 0.009 0.992
armLC-106-3 1.857e+01 2.063e+03 0.009 0.993
armLC-106-o 2.076e+01 2.063e+03 0.010 0.992
armLC-115 1.995e+01 2.063e+03 0.010 0.992
siteMGH 9.849e-11 2.917e+03 0.000 1.000
armLC-106-7:siteMGH 1.350e+00 2.917e+03 0.000 1.000
armLC-106-3:siteMGH -8.473e-01 2.917e+03 0.000 1.000
armLC-106-o:siteMGH -1.350e+00 2.917e+03 0.000 1.000
armLC-115:siteMGH -9.848e-11 2.917e+03 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 137.186 on 99 degrees of freedom
Residual deviance: 83.534 on 90 degrees of freedom
AIC: 103.53
Number of Fisher Scoring iterations: 17
There are a few ways to remedy that:
One way is to use a “pseudo-Bayesian” approach and to add 1 artificial success to all arms, but this introduces a small “defavorable” bias in the odds ratio estimates (and theoretically increases the p-values).
Code
tmp <- col_by_week5 |>bind_rows( col_by_week5 |>select(arm, site) |>distinct() |>mutate(colonized_by =TRUE) )fit <-glm(colonized_by ~ arm + site + arm:site, data = tmp, family = binomial)fit |>summary()
Call:
glm(formula = colonized_by ~ arm + site + arm:site, family = binomial,
data = tmp)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.303e+00 1.049e+00 -2.196 0.02810 *
armLC-106-7 3.283e+00 1.248e+00 2.631 0.00852 **
armLC-106-3 2.485e+00 1.211e+00 2.052 0.04015 *
armLC-106-o 4.605e+00 1.483e+00 3.105 0.00190 **
armLC-115 3.807e+00 1.308e+00 2.910 0.00361 **
siteMGH -7.406e-16 1.483e+00 0.000 1.00000
armLC-106-7:siteMGH 1.322e+00 1.938e+00 0.682 0.49529
armLC-106-3:siteMGH -7.419e-01 1.720e+00 -0.431 0.66622
armLC-106-o:siteMGH -1.322e+00 1.938e+00 -0.682 0.49529
armLC-115:siteMGH 5.634e-16 1.850e+00 0.000 1.00000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 148.06 on 109 degrees of freedom
Residual deviance: 103.03 on 100 degrees of freedom
AIC: 123.03
Number of Fisher Scoring iterations: 4
Another way is to use an actual Bayesian model that accounts for site differences, but this does not provide p-values, only posterior distributions (and confidence intervals).
Code
col_by_week5_agg <- col_by_week5 |>group_by(site, arm) |>summarise(success = colonized_by |>sum(), ntrials =n())bfit <-brm( success |trials(ntrials) ~ arm + site + arm:site, data = col_by_week5_agg, family =binomial("logit") )
Running /Library/Frameworks/R.framework/Resources/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 16.0.0 (clang-1600.0.26.4)’
using SDK: ‘’
clang -arch arm64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I"/Users/laurasymul/Library/R/arm64/4.4/library/Rcpp/include/" -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/" -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/unsupported" -I"/Users/laurasymul/Library/R/arm64/4.4/library/BH/include" -I"/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/src/" -I"/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/" -I"/Users/laurasymul/Library/R/arm64/4.4/library/RcppParallel/include/" -I"/Users/laurasymul/Library/R/arm64/4.4/library/rstan/include" -DEIGEN_NO_DEBUG -DBOOST_DISABLE_ASSERTS -DBOOST_PENDING_INTEGER_LOG2_HPP -DSTAN_THREADS -DUSE_STANC3 -DSTRICT_R_HEADERS -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION -D_HAS_AUTO_PTR_ETC=0 -include '/Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1 -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c foo.c -o foo.o
In file included from <built-in>:1:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/Dense:1:
In file included from /Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/Core:19:
/Users/laurasymul/Library/R/arm64/4.4/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
679 | #include <cmath>
| ^~~~~~~
1 error generated.
make: *** [foo.o] Error 1
Alternatively, we can use exact fisher tests (as initially proposed by Lara Lewis, CAPRISA) for each arm x site, correcting for multiple testing, but this does not allow us to directly test for site differences.